HOME CREDIT DEFAULT RISK

FP_GROUP18_HCDR


Team Members:


amlpic.png

ABSTRACT:

The project that we are working on is https://www.kaggle.com/c/home-credit-default-risk/

It is primarily due to poor or lack of credit histories of the applicants that the vast majority of loan applicants are denied. Thus, these applicants turn to untrustworthy lenders for their financial support, and they risk being taken advantage of, mostly with unreasonably high rates of interest. In order to address this issue “Home Credit” which is a 26 years old (founded in 1997) lending agency, spread across 8 countries, strives to broaden financial inclusion for the unbanked population by providing easy, fast and safe borrowing. In this project, we hope to predict a borrower's ability to repay a loan using historical loan application data and training them on predictive machine learning models, Naive Bayes, Logistic Regression, Random Forest, Stochastic GD. The models will be evaluated on ROC_AUC, Accuracy, Confusion Matrix, Log loss and, F-1 Score.

MACHINE LEARNING MODELLING:

To predict whether the applicant will repay the loan using the historical data we will be performing the following algorithms on the data to achieve most accurate results:

METRICS:


DATASETS:

Background on the dataset:

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

application_{train|test}.csv:


amldata.png

TASKS:

Phase 2:

Phase 3:

Worklow of the tasks for Phase 2 & Phase 3:

Phase3%20workflow.png

DATA DESCRIPTON

Home Credit uses various kinds of data and we have been provided with 9 datasets which will help us analyze and classify clients based on risk of non- repayment. The application test and train files represent the data that will be used to train and test, and it gives data about loan applicants during the time of filling the application. The bureau.csv has data about previous loans taken from other financial institutes and those which were reported to the Credit bureau and contains those many rows as the number of loans provided to the client. Bureau_balance file has data about client’s balances taken monthly for previous loans those reported to Credit bureau. POS_CASH_balance has a client's previous cash loans, POS balance snapshots taken monthly where each row consists of each month’s history for previous loans. Credit_card_balance file consists of balance snapshots taken monthly of the client’s previous credit cards and each row shows every month history of previous credit cards in Home Credit. previous_applications show all past applications of applicants for Home Credit loans. installments_payments show past loans history of repayment for loan amount disbursed where each row shows either a payment or missed payment that was made. HomeCredit_columns_description file has data describing the columns in other dataset files. The application train dataset has 122 columns, 307511 rows. We have 121 features where 105 are numerical and 16 are categorical. We aim to examine the qualification of individuals with no credit history in getting a home loan. Sometimes, it happens that an individual might not have a credit history but still requires a home loan. In such cases, loan approval on the basis of credit rating is not a viable option. Our project will look at all the other factors for an individual. From monthly income to previous loan applications, we will cover all the financial aspects of the individual and classify them as either ‘no risk’ individuals or ‘credit risk’ individuals. In this phase we tackle tasks like understanding various features of the raw data, performing EDA and pre-processing data, dataset split to test, train and validation, training baseline random forest and logistic regression models and analysing their performance.

We perform the following tasks:

Data description using pandas dataframe: pandas.Dataframe is a two-dimensional, size-mutable, potentially heterogenous tabular data structure that also contains labeled axes (rows and columns). Methods used to decribe data from the dataframes are:

Elimination of null/missing data: To determine the type of values/datatype|featurespresent in the dataset values as zeroes and to check for any kind of null values/missing values in our data. After checking for missing values, we remove more than 20% of the missing data. Similarly, we will remove the columns having more than 80% rows that contain values as zeros only.

Data Segregation: We segregate the data into Categorical & Numerical Variables.

Data joining/merging: We Join the features that have high correlation by odentifying them from EDA.

Best feature extraction: Extracting the top important features for the model pipelining and hyperparameter tuning.

EXPLORATORY DATA ANALYSIS

Exploratory data analysis is the crucial process of doing preliminary analyses on data in order to find patterns, identify anomalies, test hypotheses, and double-check assumptions with the aid of summary statistics and graphical representations. EDA helps with a better understanding of the variables in the data collection and their relationships, and is usually used to investigate what data might disclose beyond the formal modeling or hypothesis testing assignment. It can also assist in determining the suitability of the statistical methods you are considering using for data analysis.

FEATURE EXTRACTION

By creating new features from the existing ones (and then discarding the original features), Feature Extraction attempts to reduce the number of features in a dataset. The new reduced set of features will be able to summarize much of the information that was contained in the original set of features. Thus, an abridged version of the original features can be created by combining them.

In our analysis of the data, we found that there are many missing values. Columns with more than 20% of missing values were removed. Our team checked the columns for the distribution of 0's and removed the columns with 85% of rows with only 0's. In addition, we divided the data into numerical and categorical data. The numerical data was handled by creating an intermediate imputer pipeline in which the missing values were replaced with the mean of the data, while the missing values in categorical missing data were handled by encoding the data based upon OHE (One Hot Encoding) and replacing the missing values with the mode of the columns.

MODELLING PIPELINE

Applied machine learning is typically focused on finding a single model that performs well or best on a given dataset.

Effective use of the model will require appropriate preparation of the input data and hyperparameter tuning of the model.

Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the modeling pipeline. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction).

The below pipeline is for the project:

pipeline.jpg

MACHINE LEARNNG MODELS

Naive Bayes

The foundation of Naive Bayes is the Bayes Theorem and the presumption of predictor independence. This approach is especially helpful with a dataset this size because it makes building a model simple and doesn't require tedious iterative parameter estimation.

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. So given the previous data of the applicant we can predict the probability of the applicant being loan defaulter.

Logistic Regression

A dichotomous result's likelihood is predicted by logistic regression. Its foundation is the use of one or more predictors. The logistic curve created by this technique can only contain values between 0 and 1. Logistic Regression is used when the dependent variable(target) is categorical.

The dependent variable in logistic regression follows Bernoulli Distribution. Estimation is done through maximum likelihood. No R Square, Model fitness is calculated through Concordance, KS-Statistics

Stochastic Gradient Descent

In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration.

image.png

Random Forest

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression. Random forest works on the Bagging principle. Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging chooses a random sample from the data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. Now each model is trained independently which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as aggregation.

Steps involved in random forest algorithm:

randomforst%20final.png

RESULT DISCUSSION

The results are evaluated on different metric on the models. After building the model we evaluate them on various metrics.

METRICS:

Log Loss: Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for multi-class classification. When working with Log Loss, the classifier must assign probability to each class for all the samples.

Accuracy: Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio of number of correct predictions to the total number of input samples. It works well only if there are equal number of samples belonging to each class.

Confusion Matrix: Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions. Each row of the confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class. Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics evaluate the results.

F1-Score: F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is, as well as how robust it is. High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 Score, the better is the performance of our model.

ROC_AUC: The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

Logistic Regression Result

We have performed Logistic regression without regularization and have got an Train accuracy and valid accuracy of ~92%. Based on the ROC Area Under Curve values for the Logistic regression model, the ROC value is 0.735, showing a large amount of True Positive values, indicating a good fit to the data.

KAGGLE SUBMISSION:

PHASE 3

FEATURE ENGINEERING AND FEATURE SELECTION:

The following steps will allow us to accomplish feature selection, engineering, and model selection:

Below are the new features engineered which are amonf the top 50 features which were used for training.

Extracting features from bureau.csv and bureau_balance.csv

Extracting features from installments_payments.csv

Extracting features from POS_CASH_balance.csv

The above graph is used to visualize the importance of top 50 features from the data. These top 50 features were further used to train our models.

Logistic Regression

Random Forest

Random Forest Result

We have achieved a training accuracy of 92% and AUC score of 0.73 on test data.

Stochastic Gradient Descent

Stochastic GD Result

We have achieved a training accuracy of 92% and AUC score of 0.72

Kaggle Submission

MLP with PyTorch

TEAM AND PLAN UPDATES

Team Members

zoompic.png

Phase Leader Plan

phase%20leader%20plan.png

Credit Assignment Plan

creditasntplanphase3.png

CONCLUSION

In phase 2 of our HCDR project, we have worked on feature engineering and selection, model selection and hyperparameter tuning. We have built several pipelines as clean data-flow framework for our cross-validation-based decision-making process on ML algorithms and tunable parameters. Not entirely surprisingly, but still quite troublesome, we have faced significant challenges introduced by the size of our data and resulting computational demands forced us to scale back and adjust on our plans we originally had for this phase. We have managed to achieve an traning accuracy of 92% and AUC score of 0.73 for Random forest which is the best model implemented so far. To further improve on this, we plan a variety of model and feature (selection) refinements in the next phase, also looking for potential remedies on data size related challenges.